Freedom to act (Mushin)

Ben Whalley, Paul Sharpe, Sonja Heintz

Overview

We think the analogy to using R is clear:

  • If you are anxious, stressed or avoidant you will be distracted
  • Getting confident with the basics makes more complex techniques possible

TODO: replace with feelgood video

In this session we cover:

  • Loading data from files
  • Using simple techniques to answer research questions with data
  • Saving intermediate steps using variables

Principles/ideas

  • Using data to answer questions
  • Precision and literal-mindedness of R
  • Paths and directories

Storing data in variables

TODO: replace with video

Video summary:

  • In R, a variable is the name for a container which stores data.
  • We make variables using <-, which is called the assignment operator.
  • Values on the right hand side of <- are stored in the variable on the left hand side.
  • Variables that you create are stored in the Global Environment, which you can see using the Environment pane.
# calculate 40 + 2 and assign the result to a variable
meaning_of_life <- 40 + 2
# print variable
meaning_of_life
[1] 42

As we work, it’s useful to be able to save the results of the code we write.

As one example, we might have a dataset with multiple columns, each holding participants’ answers to an individual questionnaire item. We might want to calculate a new column —— maybe an average of each person’s scores on all of the questions —— and keep track of this so we can use it in later calculations.

Alternatively, we might want to save the result of a specific calculation and use it later on.

To do this we can create a variable.

A variable is just a container to store data in. To make variables we use the assignment operator, which looks like this <-

That is, like an arrow that points to the left. This is a reminder that the results of the calculation on the right hand side will be assigned (stored) in the variable on the left hand side.

The code in this chunk runs the calculation on the right hand side of the assignment operator, 40 + 2, and assigns the result to a new variable named meaningoflife. The output of the chunk is 42, the value of meaningoflife.

Give your variables short names which describe the data they contain. Use the underscore _ if you need to use more than one word e.g. meaning_of_life.

You might wonder where these variables get saved. In most cases, variables you create are stored in what’s called the Global Environment. You can see them in the Environment pane in RStudio. Double-clicking on any variable there will show you what is stored inside the container.

Exercise 1

  1. Open session-2.rmd using the Files pane. This is the workbook you will be using in this session.
  2. Run the first chunk in the workbook.

The output should look like this:

Results of creating meaningoflife variable

Your Environment pane should look like this:

Environment pane after creating meaningoflife variable

Exercise 2

  1. Create a level 3 markdown heading named “Exercise 2” in your workbook
  2. Create a new chunk beneath the heading
  3. Assign the results of the calculation 2 * 35 to the variable seventy
  4. Run the chunk

Your Environment should now look like this:

Environment pane after creating myfirstvariable

Exercise 3

  1. Use R to calculate your age in the year 2051.
  2. Save the result in a variable with a descriptive name.

Passing data to commands using the pipe %>%

TODO: replace with video

Video summary:

  • The pipe command %>% sends data from one piece of code to another.
  • You can use the assignment operator to store the results of a pipeline in a variable.
# pipe mtcars into head()
mtcars %>% head()
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

# store first few rows of mtcars
mtcars_head <- mtcars %>%
  head()

Sometimes we need to link together multiple steps in our analysis.

For example, if we’re working with a big dataset we might want to select only some of the columns, and then filter out some of the rows of data, and the finally calculate descriptive statistics.

We could do this by creating lots of variables, each one saving the results at each intermediate step. This can get confusing, though.

Instead we can use what’s known as a ‘pipe’ — it’s another way to link together multiple instructions.

The pipe sends data from one piece of code to another.

The pipe looks like this %>%.

In session 1, you used this command to “pipe” the mtcars dataset into head, which shows just the first few rows:

mtcars %>% head()
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

You can think of your data as flowing along lengths of pipe, joined by commands which do things to the data, step by step, until the result you want plops out at the end.

Each command should be read as the word “then”, e.g. “pipe mtcars data, then head it”.

The > in the pipe command reminds you of the direction in which your data is flowing (it only works left to right).

It’s important to know that the pipe command doesn’t store the results of these steps.

Sometimes that’s OK. In our first example we just wanted to look at the first few rows of the mtcars data.

But, you will usually want to save the result of a pipeline in a new variable.
For example, if we wanted to save the first few rows of the mtcars data to a new variable we would write:

mtcars_head <- mtcars %>% head()

Here we combine assignment with a pipeline.

The result of the pipeline (a data.frame containing the first few rows of mtcars) is saved to a new variable called mtcars_head.

You can explore your variables using the Environment pane. A data.frame will have an icon that looks like a spreadsheet. If you [click on the icon], the data.frame is displayed in a new tab in the Source pane.

This tab shows you the same information as printing the data.frame, such as the number of rows and columns, but it also provides tools for exploring the data interactively.

  • The arrows next to the column names allow you to arrange the rows in ascending or descending order based on the column values.
  • The Filter button allows you to specify a value for one or more columns to filter out non-matching rows. For example, we could display just cars with 4 gears. Click the button again to turn off the filter.

Exercise 3

  1. Create a level 3 markdown heading named “Exercise 3” in your workbook. (You should be used to doing this for every exercise by now, so we won’t remind you again.)
  2. Create a new chunk beneath the heading
  3. Load the tidyverse library
  4. Pipe the mpg data.frame into head() and assign the results to a variable called mpg_head
  5. Use the Environment pane to open mpg_head

In 1999, a 6 cylinder, manual transmission, Audio A4 could cover miles per gallon when driven in the city.

Loading data from elsewhere

TODO: replace with video

  • Often we want to load data into R (not just use built in data)
  • The preferred format for data files in R is comma-separated value (CSV)
  • CSV data can be read using the read_csv command
  • You can load data from an internet address (URL) or a file uploaded to the server

Loading data

In a lot of these sessions we use datasets that are built-in to R because it’s quick and convenient to illustrate the points we make.

[demo opening glancing some built in data like gapminder, iris, mtcars etc]

Normally, though, you will need to load your own data.

R can read data from two places:

  • A URL (web address), if the data file is available on the internet somewhere
  • A file on computer that R is running on

The link below is a URL (web address) for a file containing data about US police shootings.

The final part of the url tells us the name of the file: shootings.csv

The final 3 (sometimes 4) letters of the filename is called the file extension.

Here the file extension is .csv, which stands for ‘comma separated values’ or CSV.

CSV is a common data type. Most data-oriented programmes (e.g. Excel or Open Office or SPSS) can read and write .csv files, so it’s a good choice for storing and sharing data.

If you click on the link [click link in vid] you’ll see the first line is a list of column names separated by commas.

The remaining lines contain rows of data matching the column headings. For example, the value of the arms_category column in row 1 is Guns.

The read_csv() command reads a CSV file, and converts it to a data.frame, which is the format we use in R.

We can use read_csv() to load data from either a file, or over the internet, which is shown in the next video.

Reading CSV files from the internet

TODO: replace with video

Video summary:

  • read_csv('http://...') can load data from a URL
  • it converts the data to a data.frame
  • you must assign the loaded data to a variable (and give it a descriptive name)
  • once loaded, you can view the data using the Environment pane

CSV files are a common format to store and share data. As shown in the previous video, the first line of a CSV file tells the column names, and the remaining lines are rows of data.

The read_csv() command reads a CSV file, and converts it to a data.frame, which is the format we use in R. We can load data either from a file, or over the internet.

In this example, I’m reading a CSV directly over the Internet and storing the resulting data.frame in a variable.

The URL (the link to the CSV file) needs to be in quotes (single or double quotes both work).

shootings <- read_csv('https://benwhalley.github.io/lifesavR/lifesavr/shootings.csv')

Because we made a new variable, the result is stored in the Environment, and we can double-click it to have a look at the data.

An alternative (and recommended) way is to type the name of the variable as a very simple command:

shootings
# A tibble: 4,895 x 15
      id name   date       manner_of_death  armed   age gender race  city  state
   <dbl> <chr>  <date>     <chr>            <chr> <dbl> <chr>  <chr> <chr> <chr>
 1     3 Tim E… 2015-01-02 shot             gun      53 M      Asian Shel… WA   
 2     4 Lewis… 2015-01-02 shot             gun      47 M      White Aloha OR   
 3     5 John … 2015-01-03 shot and Tasered unar…    23 M      Hisp… Wich… KS   
 4     8 Matth… 2015-01-04 shot             toy …    32 M      White San … CA   
 5     9 Micha… 2015-01-04 shot             nail…    39 M      Hisp… Evans CO   
 6    11 Kenne… 2015-01-04 shot             gun      18 M      White Guth… OK   
 7    13 Kenne… 2015-01-05 shot             gun      22 M      Hisp… Chan… AZ   
 8    15 Brock… 2015-01-06 shot             gun      35 M      White Assa… KS   
 9    16 Autum… 2015-01-06 shot             unar…    34 F      White Burl… IA   
10    17 Lesli… 2015-01-06 shot             toy …    47 M      Black Knox… PA   
# … with 4,885 more rows, and 5 more variables: signs_of_mental_illness <lgl>,
#   threat_level <chr>, flee <chr>, body_camera <lgl>, arms_category <chr>

XXX TDY UP

The following code reads the US police shootings CSV file from the Internet, and stores the resulting data.frame in the variable shootings.

shootings <- read_csv('https://benwhalley.github.io/lifesavR/lifesavr/shootings.csv')

EXERCISES HERE….

Using data from your computer

VIDEO (covers these things)

  • To use data from your computer you need to upload it to the server
  • The files pane allows this
  • Once on the server, read_csv needs a path
  • Always store data next to your R code
  • you must assign the loaded data to a variable (and give it a descriptive name)
  • once loaded, you can view the data using the Environment pane

XXX TODO TIDY THIS

The British School Success Survey is a dataset used to predict school performance.

Exercise 4

  1. Copy the code above into a new chunk your workbook and run it
  2. Add a line that reads schoolpredict.csv into the variable school_predict
  3. Open school_predict in the Source window

The school with the most teaching assistants has teachers.

The Files pane

You can also use read_csv() to read CSV files from within R Studio.

The CSV files which you read from the Internet are also included in the folder which contains your workbooks. [You can see them using the Files pane.]

The command shootings <- read_csv('shootings.csv') reads the US police shootings data from the same folder as the R Markdown file into shootings.

The Upload button in the Files pane lets you upload a file from your computer to R Studio. R Studio uses file extensions to guess what the file contains. A file extension is a sequence of characters, starting with a . at the end of a file name.

  • .csv - CSV file
  • .rmd - R Markdown file

Make sure that any file you upload has the correct file extension.

We’ll upload a CSV file which contains data about students who took a maths class.

  1. Click the Upload button.
  2. Ensure the Target directory is where you want the uploaded file to appear. For this module it should read ~/lifesavr. The ~ (pronounced “tilde”) means your Home directory on the R Studio server. The /lifesavr means the folder named lifesaver in Home.
  3. Click the Choose file button and select the file you want to upload. After you select a file, its name appears next to the button.
  4. Click the **OK** button.

The file should appear in the Files pane in your lifesavr folder.

Exercise 5

  1. Open https://benwhalley.github.io/lifesavR/data/student-mat.csv in a web browser
  2. Use your web browser to save the file as student-mat.csv
  3. Upload student-mat.csv to the folder containing your workbooks
  4. Read student-mat.csv into the variable student_maths
  5. Open student_maths in the Source window

The oldest student drinks units of alcohol (column Walc) at the weekend.

Selecting rows with filter()

TODO: replace with video

Video summary:

  • filter() selects rows from a dataset which match criteria we set
  • the simplest way to filter is to use ==, to test if the row is an exact match
  • we can use other filters like < or > too, for numeric columns
  • we can combine multiple filters to get exactly the rows we need
## code from video with summary as comments
## EXAMPLE - DO AFTER VIDEOS DONE - EASY TO DO AT THE END ONCE CODE IS DECIDED

# filter rows where country is equal to the word "Kenya"
# remember to use two = signs together, ==
gapminder %>% 
  filter(country == "Kenya")
# A tibble: 12 x 6
   country continent  year lifeExp      pop gdpPercap
   <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
 1 Kenya   Africa     1952    42.3  6464046      854.
 2 Kenya   Africa     1957    44.7  7454779      944.
 3 Kenya   Africa     1962    47.9  8678557      897.
 4 Kenya   Africa     1967    50.7 10191512     1057.
 5 Kenya   Africa     1972    53.6 12044785     1222.
 6 Kenya   Africa     1977    56.2 14500404     1268.
 7 Kenya   Africa     1982    58.8 17661452     1348.
 8 Kenya   Africa     1987    59.3 21198082     1362.
 9 Kenya   Africa     1992    59.3 25020539     1342.
10 Kenya   Africa     1997    54.4 28263827     1360.
11 Kenya   Africa     2002    51.0 31386842     1288.
12 Kenya   Africa     2007    54.1 35610177     1463.

# filter only rows where year is greater than 2000
gapminder %>% 
  filter(year > 2000)
# A tibble: 284 x 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       2002    42.1 25268405      727.
 2 Afghanistan Asia       2007    43.8 31889923      975.
 3 Albania     Europe     2002    75.7  3508512     4604.
 4 Albania     Europe     2007    76.4  3600523     5937.
 5 Algeria     Africa     2002    71.0 31287142     5288.
 6 Algeria     Africa     2007    72.3 33333216     6223.
 7 Angola      Africa     2002    41.0 10866106     2773.
 8 Angola      Africa     2007    42.7 12420476     4797.
 9 Argentina   Americas   2002    74.3 38331121     8798.
10 Argentina   Americas   2007    75.3 40301927    12779.
# … with 274 more rows

# show rows with low life expectancy
gapminder %>% 
  filter(lifeExp < 35)
# A tibble: 33 x 6
   country      continent  year lifeExp      pop gdpPercap
   <fct>        <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan  Asia       1952    28.8  8425333      779.
 2 Afghanistan  Asia       1957    30.3  9240934      821.
 3 Afghanistan  Asia       1962    32.0 10267083      853.
 4 Afghanistan  Asia       1967    34.0 11537966      836.
 5 Angola       Africa     1952    30.0  4232095     3521.
 6 Angola       Africa     1957    32.0  4561361     3828.
 7 Angola       Africa     1962    34    4826015     4269.
 8 Burkina Faso Africa     1952    32.0  4469979      543.
 9 Burkina Faso Africa     1957    34.9  4713416      617.
10 Cambodia     Asia       1977    31.2  6978607      525.
# … with 23 more rows

# combine multiple filters
gapminder::gapminder %>% 
  filter(country=="Kenya") %>% 
  filter(year > 2000) %>% 
  filter(lifeExp < 35)
# A tibble: 0 x 6
# … with 6 variables: country <fct>, continent <fct>, year <int>,
#   lifeExp <dbl>, pop <int>, gdpPercap <dbl>
library(gapminder)
gapminder %>% filter(country == "Kenya")
# A tibble: 12 x 6
   country continent  year lifeExp      pop gdpPercap
   <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
 1 Kenya   Africa     1952    42.3  6464046      854.
 2 Kenya   Africa     1957    44.7  7454779      944.
 3 Kenya   Africa     1962    47.9  8678557      897.
 4 Kenya   Africa     1967    50.7 10191512     1057.
 5 Kenya   Africa     1972    53.6 12044785     1222.
 6 Kenya   Africa     1977    56.2 14500404     1268.
 7 Kenya   Africa     1982    58.8 17661452     1348.
 8 Kenya   Africa     1987    59.3 21198082     1362.
 9 Kenya   Africa     1992    59.3 25020539     1342.
10 Kenya   Africa     1997    54.4 28263827     1360.
11 Kenya   Africa     2002    51.0 31386842     1288.
12 Kenya   Africa     2007    54.1 35610177     1463.

This chunk filters the gapminder dataset to include only rows where the country column equals “Kenya”.

The == is called an “operator”. It compares values from the column on the left hand side with the value specified on the right hand side. The value must match the column type. The value "Kenya" was in quotes because the country column is a factor.

The > operator

The “greater than” operator > filters numeric data.

gapminder %>% filter(year > 2000)
# A tibble: 284 x 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       2002    42.1 25268405      727.
 2 Afghanistan Asia       2007    43.8 31889923      975.
 3 Albania     Europe     2002    75.7  3508512     4604.
 4 Albania     Europe     2007    76.4  3600523     5937.
 5 Algeria     Africa     2002    71.0 31287142     5288.
 6 Algeria     Africa     2007    72.3 33333216     6223.
 7 Angola      Africa     2002    41.0 10866106     2773.
 8 Angola      Africa     2007    42.7 12420476     4797.
 9 Argentina   Americas   2002    74.3 38331121     8798.
10 Argentina   Americas   2007    75.3 40301927    12779.
# … with 274 more rows

This chunk filters rows where year is greater than 2000.

The < operator

The opposite of the > operator is the < operator. This filters numeric columns which are less than a value.

Exercise 6

Filter gapminder to show rows where life expectancy is less than 35.

The results should look like this:

# A tibble: 33 x 6
   country      continent  year lifeExp      pop gdpPercap
   <fct>        <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan  Asia       1952    28.8  8425333      779.
 2 Afghanistan  Asia       1957    30.3  9240934      821.
 3 Afghanistan  Asia       1962    32.0 10267083      853.
 4 Afghanistan  Asia       1967    34.0 11537966      836.
 5 Angola       Africa     1952    30.0  4232095     3521.
 6 Angola       Africa     1957    32.0  4561361     3828.
 7 Angola       Africa     1962    34    4826015     4269.
 8 Burkina Faso Africa     1952    32.0  4469979      543.
 9 Burkina Faso Africa     1957    34.9  4713416      617.
10 Cambodia     Asia       1977    31.2  6978607      525.
# … with 23 more rows

Combined filters

gapminder::gapminder %>% 
  filter(country=="Kenya") %>% 
  filter(year > 2000)
# A tibble: 2 x 6
  country continent  year lifeExp      pop gdpPercap
  <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
1 Kenya   Africa     2002    51.0 31386842     1288.
2 Kenya   Africa     2007    54.1 35610177     1463.

Sorting data using arrange()

remind them they know how to make scatter and boxplots

  • "what is the size of the largest diamond (by carat) in the diamonds dataset?
  • “what cut were the three largest diamonds in that dataset?”

TODO: replace with video

diamonds %>% arrange(-carat) %>% head(3)
# A tibble: 3 x 10
  carat cut   color clarity depth table price     x     y     z
  <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  5.01 Fair  J     I1       65.5    59 18018  10.7 10.5   6.98
2  4.5  Fair  J     I1       65.8    58 18531  10.2 10.2   6.72
3  4.13 Fair  H     I1       64.8    61 17329  10    9.85  6.43

Combine filtering and sorting

TODO: replace with video

What was the year Kenyans had the lowest life exp:

gapminder::gapminder %>% filter(country=="Kenya") %>% 
  arrange(lifeExp) %>% 
  head(6)
# A tibble: 6 x 6
  country continent  year lifeExp      pop gdpPercap
  <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
1 Kenya   Africa     1952    42.3  6464046      854.
2 Kenya   Africa     1957    44.7  7454779      944.
3 Kenya   Africa     1962    47.9  8678557      897.
4 Kenya   Africa     1967    50.7 10191512     1057.
5 Kenya   Africa     2002    51.0 31386842     1288.
6 Kenya   Africa     1972    53.6 12044785     1222.

What was the highest year? All that changes is the minus sign (reverse sorting)

gapminder::gapminder %>% 
  filter(country=="Kenya") %>% 
  arrange(-lifeExp) 
# A tibble: 12 x 6
   country continent  year lifeExp      pop gdpPercap
   <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
 1 Kenya   Africa     1987    59.3 21198082     1362.
 2 Kenya   Africa     1992    59.3 25020539     1342.
 3 Kenya   Africa     1982    58.8 17661452     1348.
 4 Kenya   Africa     1977    56.2 14500404     1268.
 5 Kenya   Africa     1997    54.4 28263827     1360.
 6 Kenya   Africa     2007    54.1 35610177     1463.
 7 Kenya   Africa     1972    53.6 12044785     1222.
 8 Kenya   Africa     2002    51.0 31386842     1288.
 9 Kenya   Africa     1967    50.7 10191512     1057.
10 Kenya   Africa     1962    47.9  8678557      897.
11 Kenya   Africa     1957    44.7  7454779      944.
12 Kenya   Africa     1952    42.3  6464046      854.

Combining rows using summarise()

TODO: replace with video

  • Often you have lots of data and need to make summaries of it — e.g. to calculate the average of a column
  • The summarise() function takes many rows and uses a function to convert those into fewer rows.
  • We can use many different functions with summarise, but
  • common choices are functions for descriptive statistics, like mean, median, or sd (short for standard deviation)
mtcars %>% summarise(average_mpg = mean(mpg))
  average_mpg
1    20.09062

Using filter() and summarise() together

  • Using the pipe (%>%), we can combine multiple steps
  • It’s common to want to filter out certain rows, before using summarise
mtcars %>% 
  filter(am==1) %>% 
  summarise(mean(mpg))
  mean(mpg)
1  24.39231

Grouping results with group_by

TODO: replace with video

  • In our data we may have categorical variables (e.g. gender, or country)
  • We often want to compute summaries for each group
  • Using filter(), we could make a summary for each group, one by one; the group_by function does this for us
  • If you add group_by() to a pipeline then all the subsequent steps are run once for each group
  • Be careful only to group by categorical variables

We might make a plot like this:

mtcars %>% 
  ggplot(aes(factor(cyl), mpg)) + 
  geom_boxplot()

But what if we want these numbers in a table (or to report in our report)? We can do that using group_by and summarise…

mtcars %>% 
  group_by(cyl) %>% 
  summarise(average_mpg = mean(mpg))
# A tibble: 3 x 2
    cyl average_mpg
* <dbl>       <dbl>
1     4        26.7
2     6        19.7
3     8        15.1

We can also group by two variables at once and get a row for each combination:

mtcars %>% group_by(cyl, am) %>% summarise(mean(mpg))
# A tibble: 6 x 3
# Groups:   cyl [3]
    cyl    am `mean(mpg)`
  <dbl> <dbl>       <dbl>
1     4     0        22.9
2     4     1        28.1
3     6     0        19.1
4     6     1        20.6
5     8     0        15.0
6     8     1        15.4

Check your knowledge

Write an answer to each of these questions in the Check your knowledge section of your workbook. The answers will be revealed in Session 3.

  • What is the %>% symbol called and what does it do?
  • What is the <- symbol called and what does it do?

Practice problems

Additional questions

  • In the gapminder dataset, what country had the highest life expectancy in 1952? (Use arrange, filter and head)
gapminder::gapminder %>% 
  filter(year == 1952) %>% 
  arrange(-lifeExp) %>% 
  head(1)
# A tibble: 1 x 6
  country continent  year lifeExp     pop gdpPercap
  <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
1 Norway  Europe     1952    72.7 3327728    10095.
  • What continent had the highest GDP in 2011? (Use arrange, group_by, and summarise)
gapminder::gapminder %>% 
  group_by(continent) %>% 
  summarise(average_gdp = mean(gdpPercap)) %>% 
  arrange(-average_gdp)
# A tibble: 5 x 2
  continent average_gdp
  <fct>           <dbl>
1 Oceania        18622.
2 Europe         14469.
3 Asia            7902.
4 Americas        7136.
5 Africa          2194.
  • Make a boxplot showing life expectancy by continent. (Use filter, ggplot and geom_boxplot)
gapminder::gapminder %>% 
  filter(year > 2000) %>% 
  ggplot(aes(continent, lifeExp)) + 
  geom_boxplot()

“Mega problem”

Describe these as the ‘end of level boss characters’. You need to combine all your skills to beat them…

Make a table which shows the average life expectancy for each continent, sorted from highest to lowest:

gapminder::gapminder %>% 
  group_by(continent) %>% 
  summarise(life_expectancy = mean(lifeExp)) %>% 
  arrange(-life_expectancy)
# A tibble: 5 x 2
  continent life_expectancy
  <fct>               <dbl>
1 Oceania              74.3
2 Europe               71.9
3 Americas             64.7
4 Asia                 60.1
5 Africa               48.9

Broken script to fix

  • Fix a ‘broken’ script: Start a NEW R session and make this code work:
liibrary(todyverse)

# make a density plot of of life expectacy with different color lines for each continent
gapminder::gapminder %>% 
  ggplote(aes("lifeExp", colr = "Continent"))  geom_density()

# select only years after 1990
gapminder::gapminder %>% 
  filter(year > 1990)

ggplot(aes(year, lifeExp, color=continent)) + 
  geom_jitter()
  

NOTE - we will know all the errors they will see so can provide hints for each of them

Correct version would be:

library(tidyverse)

# make a density plot of of life expectacy with different color lines for each continent
gapminder::gapminder %>% 
  ggplot(aes(lifeExp, color = continent))  + 
  geom_density()


# select only years after 1990
gapminder::gapminder %>% 
  filter(year > 1990) %>%
  ggplot(aes(year, lifeExp, color=continent)) + 
  geom_jitter()